1 Large-Language Models (LLMs)

Table 1.1: LLMs
provider model estimate class
1 anthropic claude-3-7-sonnet-20250219 3.8580848 top
2 anthropic claude-3-5-sonnet-20241022 3.4210271 top
3 xai grok-3-beta 3.0488472 top
4 anthropic claude-3-haiku-20240307 0.3764656 bottom
5 cohere command-r-08-2024 0.3764656 bottom
6 openai gpt-3.5-turbo 0.3299676 bottom
7 openai gpt-4o-mini 0.2865677 bottom
8 google gemini-2.5-flash NA new

Building on our previous analysis, we selected models based on their performance. We chose 4 top1, which were consistently more consistent than chance, and 4 bottom models, which were consistently less consistent than chance in terms of deliberative reasoning.

2 Cases, surveys, and roles

Table 2.1: Cases
case survey N topic subtopic
1 CCPS ACT Deliberative ccps 31 climate climate
2 CSIRO WA energy_futures 17 climate energy
3 Winterthur zh_winterthur 16 climate climate
Table 2.1: Surveys
survey considerations policies scale_max q_method
1 ccps 33 7 11 FALSE
2 energy_futures 45 9 11 FALSE
3 zh_winterthur 30 6 7 FALSE
Table 2.1: Roles
uid type article role description
1 eco ideology an ecologist focuses on environmental protection and sustainability, advocating for societal change to ecological limits
2 coa perspective a coastal resident endures chronic flooding and salinization, forced to relocate due to rising sea levels and intense storms worsened by climate change
3 ctr perspective a construction worker suffers from extreme heat stress and lost work hours, perceiving climate change making outdoor labor unbearable and life-threatening
4 dis perspective a disease survivor recovers from dengue fever, aware that climate change’s rising temperatures are expanding the range of disease-carrying mosquitoes in their region
5 eld perspective an elderly urban resident endures intensified city heatwaves, struggling with disrupted services and feeling the direct, severe impact of climate change
6 far perspective a displaced family loses their home due to unprecedented wildfires, experiencing displacement and recognizing climate change as the major driver of the devastation
7 fis perspective a fisher notes his declining catches due to warming oceans, understanding that climate change is reorganizing marine life and reducing their traditional yield
8 lan perspective a landowner surveys his parched fields after a prolonged drought, feeling the compounding impacts of climate change that reduce crop yields and family income
9 par perspective a parent sees their child fall ill from a water-borne disease, attributing its spread to the increased heavy rainfall and warmer temperatures brought by climate change
10 sub perspective a subsistence farmer watches his crops wither under erratic rainfall patterns, and who sees these changes as direct consequence of climate change
11 vil perspective a villager faces dwindling, contaminated water supplies due to extended draughts and floods, aware that climate change is altering their water security
12 csk devils a climate skeptic prioritizes economic growth over CO2 emission cuts, fossil fuels over renewable energy, and does not believe in climate science

2.1 System prompt

We instructed LLMs to play each of the roles described above by including a system instruction in each request following the pattern:

Answer the following prompts as [article] [role], who [description].

For example:

Answer the following prompts as a climate skeptic, who prioritizes economic growth over CO2 emission cuts, fossil fuels over renewable energy, and does not believe in climate science.

3 Methods

3.1 Data collection

We collected 1440 responses generated by 8 models cross 3 surveys and 12 roles described above. We prompted each LLM 5 times with the same prompt.

3.2 Analysis

We calculated one DRI value per model/survey/role by treating each LLM response as one participant in a deliberation. The role “all” indicates that all roles were part of that deliberation (n = 60 participants, which equals 5 participants for each of the 12 roles). DRI plots are shown in Figure 4.3.

4 Findings

4.1 Consistency

We compared the compared top with bottom models in terms of consistency of DRI and Cronbach’s Alpha (see top models in Figure 4.1 and bottom models in Figure 4.2).

4.1.1 Top models

Top models

Figure 4.1: Top models

We found that top LLMs are consistent across roles both in terms of DRI and Cronbach’s Alpha (policies). The high DRI across roles (median = 0.637; IQR = 0.161) suggests that LLMs tend to consistenly align their considerations and policy preferences. The high Cronbach’s alpha for their policy preferences (median = 0.784; IQR = 0.047) suggests that LLMs tend to agree on the ranking of their policy preferences.

4.1.2 Bottom models

Bottom models

Figure 4.2: Bottom models

We also found that bottom LLMs are not consistent across roles in terms of DRI and less consistent than top models in terms of Cronbach’s Alpha (policies). The low DRI across roles (median = -0.177; IQR = 0.163) suggests that LLMs tend to consistenly misalign their considerations and policy preferences. The Cronbach’s alpha (lower than top models) for their policy preferences (median = 0.635; IQR = 0.11) suggests that LLMs tend to agree less on the ranking of their policy preferences than top models.

4.1.3 Summary for each model

4.1.3.1 DRI

Table 4.1: Mean DRI across models and roles
role claude-3-5-sonnet-20241022 claude-3-7-sonnet-20250219 claude-3-haiku-20240307 command-r-08-2024 gemini-2.5-flash gpt-3.5-turbo gpt-4o-mini grok-3-beta best_model
1 all 0.512 0.639 -0.291 -0.281 0.638 -0.213 0.000 0.625 claude-3-7-sonnet-20250219
2 coa 0.350 0.565 -0.526 -0.435 0.810 -0.315 -0.019 0.567 gemini-2.5-flash
3 csk 0.543 0.773 -0.118 -0.580 0.875 0.163 -0.153 0.795 gemini-2.5-flash
4 ctr 0.343 0.567 -0.368 -0.264 0.663 -0.129 0.252 0.447 gemini-2.5-flash
5 dis 0.476 0.538 -0.553 -0.490 0.569 -0.719 0.057 0.455 gemini-2.5-flash
6 eco 0.364 0.720 -0.281 -0.831 0.854 -0.472 0.084 0.696 gemini-2.5-flash
7 eld 0.404 0.498 -0.335 -0.396 0.796 -0.078 -0.322 0.626 gemini-2.5-flash
8 far 0.479 0.651 -0.524 -0.673 0.821 -0.388 -0.370 0.497 gemini-2.5-flash
9 fis 0.497 0.593 -0.492 -0.560 0.685 -0.665 -0.244 0.602 gemini-2.5-flash
10 lan 0.595 0.633 -0.318 -0.347 0.477 -0.466 0.199 0.587 claude-3-7-sonnet-20250219
11 par 0.498 0.708 -0.669 -0.472 0.598 -0.164 -0.284 0.670 claude-3-7-sonnet-20250219
12 sub 0.526 0.712 -0.433 -0.218 0.556 -0.106 -0.014 0.654 claude-3-7-sonnet-20250219
13 vil 0.581 0.604 -0.612 -0.550 0.407 -0.490 -0.252 0.613 grok-3-beta

4.1.3.2 Cronbach’s Alpha (Policies)

Table 4.2: Mean alpha (policies) across models and roles
role claude-3-5-sonnet-20241022 claude-3-7-sonnet-20250219 claude-3-haiku-20240307 command-r-08-2024 gemini-2.5-flash gpt-3.5-turbo gpt-4o-mini grok-3-beta best_model
1 all 0.725 0.792 0.614 0.638 0.801 0.599 0.641 0.818 grok-3-beta
2 coa 0.713 0.745 0.816 0.808 0.771 0.737 0.763 0.807 claude-3-haiku-20240307
3 csk 0.783 0.802 0.813 0.708 0.848 0.764 0.715 0.851 grok-3-beta
4 ctr 0.749 0.791 0.774 0.776 0.918 0.787 0.727 0.755 gemini-2.5-flash
5 dis 0.761 0.772 0.669 0.802 0.771 0.762 0.756 0.796 command-r-08-2024
6 eco 0.764 0.844 0.711 0.730 0.814 0.800 0.759 0.716 claude-3-7-sonnet-20250219
7 eld 0.722 0.793 0.788 0.740 0.741 0.801 0.813 0.828 grok-3-beta
8 far 0.726 0.807 0.791 0.843 0.827 0.769 0.828 0.824 command-r-08-2024
9 fis 0.787 0.792 0.690 0.793 0.829 0.750 0.825 0.704 gemini-2.5-flash
10 lan 0.715 0.792 0.802 0.805 0.789 0.783 0.795 0.792 command-r-08-2024
11 par 0.785 0.704 0.774 0.777 0.790 0.778 0.762 0.833 grok-3-beta
12 sub 0.841 0.800 0.671 0.754 0.761 0.760 0.803 0.839 claude-3-5-sonnet-20241022
13 vil 0.708 0.818 0.770 0.794 0.808 0.786 0.798 0.662 claude-3-7-sonnet-20250219

4.1.3.3 Cronbach’s Alpha (Consideration)

Table 4.3: Mean alpha (considerations) across models and roles
role claude-3-5-sonnet-20241022 claude-3-7-sonnet-20250219 claude-3-haiku-20240307 command-r-08-2024 gemini-2.5-flash gpt-3.5-turbo gpt-4o-mini grok-3-beta best_model
1 all 0.990 0.990 0.976 0.975 0.984 0.911 0.976 0.987 claude-3-5-sonnet-20241022
2 coa 0.863 0.918 0.880 0.787 0.849 0.886 0.837 0.891 claude-3-7-sonnet-20250219
3 csk 0.769 0.856 0.898 0.767 0.551 0.952 0.817 0.831 gpt-3.5-turbo
4 ctr 0.916 0.909 0.872 0.915 0.852 0.916 0.852 0.906 claude-3-5-sonnet-20241022
5 dis 0.905 0.921 0.894 0.904 0.859 0.918 0.876 0.896 claude-3-7-sonnet-20250219
6 eco 0.900 0.860 0.884 0.827 0.842 0.865 0.871 0.863 claude-3-5-sonnet-20241022
7 eld 0.917 0.899 0.919 0.886 0.917 0.911 0.879 0.903 claude-3-haiku-20240307
8 far 0.905 0.848 0.919 0.747 0.815 0.774 0.860 0.905 claude-3-haiku-20240307
9 fis 0.916 0.895 0.894 0.907 0.896 0.918 0.891 0.905 gpt-3.5-turbo
10 lan 0.917 0.914 0.884 0.904 0.884 0.885 0.909 0.917 claude-3-5-sonnet-20241022
11 par 0.925 0.905 0.863 0.867 0.830 0.888 0.885 0.922 claude-3-5-sonnet-20241022
12 sub 0.902 0.919 0.895 0.758 0.851 0.889 0.906 0.911 claude-3-7-sonnet-20250219
13 vil 0.881 0.880 0.914 0.901 0.873 0.927 0.895 0.887 gpt-3.5-turbo

4.2 Model/Survey DRI Plots

These plots show a simulated deliberation across all 12 roles for each surveys and model. Each simulated deliberation has 60 participants (12 roles with 5 participants each).

Note that bottom models are visually inconsistent.

DRI Plots

Figure 4.3: DRI Plots

4.3 Survey/Role DRI Plots

These plots show a simulated deliberation across all models in the same class (i.e., top, bottom) for each role and survey. Each simulated deliberation has 20 participants (4 models with 5 participants each).

Note that top models are visually more consistent than bottom models.

4.3.1 Top models

4.3.2 Bottom models

4.4 Permutation tests

We conducted permutation tests with 10^{4} iterations to check which models are consistently consistent and which roles are consistently consistent.

4.4.1 Models and Surveys: Which models are truly consistent across roles?

In this permutation test, we explore the likelihood that the consistency, measured by DRI, is due to chance across surveys and roles.

Survey/Model Permutation Test

Figure 4.4: Survey/Model Permutation Test

Table 4.4: Survey/Model Permutation Summary
obs_dri p n min max median iqr mean sd se ci survey model
0.417 0.000 10000 -0.295 0.278 -0.235 0.113 -0.200 0.075 0.001 0.001 ccps claude-3-5-sonnet-20241022
0.676 0.000 10000 -0.163 0.362 -0.118 0.136 -0.073 0.084 0.001 0.002 ccps claude-3-7-sonnet-20250219
0.427 0.000 10000 -0.301 0.294 -0.257 0.115 -0.220 0.072 0.001 0.001 ccps grok-3-beta
0.711 0.000 10000 -0.110 0.406 -0.061 0.130 -0.016 0.083 0.001 0.002 ccps gemini-2.5-flash
0.123 0.000 10000 -0.374 0.122 -0.269 0.098 -0.260 0.072 0.001 0.001 ccps gpt-4o-mini
-0.171 0.000 10000 -0.515 -0.215 -0.422 0.067 -0.415 0.049 0.000 0.001 ccps claude-3-haiku-20240307
0.497 0.000 10000 -0.299 0.307 -0.227 0.116 -0.195 0.076 0.001 0.001 energy_futures claude-3-5-sonnet-20241022
0.591 0.000 10000 -0.212 0.269 -0.155 0.124 -0.119 0.081 0.001 0.002 energy_futures claude-3-7-sonnet-20250219
0.691 0.000 10000 -0.142 0.367 -0.087 0.130 -0.050 0.083 0.001 0.002 energy_futures grok-3-beta
0.527 0.000 10000 -0.243 0.302 -0.153 0.113 -0.133 0.078 0.001 0.002 energy_futures gemini-2.5-flash
-0.010 0.000 10000 -0.266 -0.027 -0.191 0.046 -0.187 0.034 0.000 0.001 energy_futures gpt-4o-mini
-0.137 0.000 10000 -0.229 -0.144 -0.190 0.015 -0.189 0.011 0.000 0.000 energy_futures gpt-3.5-turbo
0.624 0.000 10000 -0.161 0.468 -0.119 0.125 -0.075 0.079 0.001 0.002 zh_winterthur claude-3-5-sonnet-20241022
0.649 0.000 10000 -0.132 0.468 -0.082 0.123 -0.041 0.077 0.001 0.002 zh_winterthur claude-3-7-sonnet-20250219
0.759 0.000 10000 -0.028 0.564 0.009 0.126 0.052 0.078 0.001 0.002 zh_winterthur grok-3-beta
0.677 0.000 10000 -0.188 0.317 -0.138 0.130 -0.094 0.082 0.001 0.002 zh_winterthur gemini-2.5-flash
-0.112 0.000 10000 -0.559 -0.155 -0.460 0.075 -0.453 0.055 0.001 0.001 zh_winterthur gpt-4o-mini
-0.284 0.000 10000 -0.427 -0.291 -0.368 0.030 -0.367 0.021 0.000 0.000 zh_winterthur gpt-3.5-turbo
-0.182 0.000 10000 -0.453 -0.171 -0.357 0.054 -0.354 0.040 0.000 0.001 energy_futures claude-3-haiku-20240307
-0.238 0.000 10000 -0.579 -0.236 -0.477 0.063 -0.473 0.046 0.000 0.001 ccps command-r-08-2024
0.075 0.001 10000 -0.091 0.127 -0.034 0.036 -0.030 0.027 0.000 0.001 energy_futures command-r-08-2024
-0.520 0.045 10000 -0.705 -0.444 -0.603 0.061 -0.601 0.044 0.000 0.001 zh_winterthur claude-3-haiku-20240307
-0.680 0.133 10000 -0.795 -0.580 -0.716 0.040 -0.713 0.030 0.000 0.001 zh_winterthur command-r-08-2024
-0.218 0.314 10000 -0.298 -0.098 -0.234 0.042 -0.231 0.030 0.000 0.001 ccps gpt-3.5-turbo

Most models seem to be consistent across roles. Few of the 10,000 permutations led to a higher DRI than the observed DRI, suggesting that the observed value is likely not due to chance.

Note that this permutation test took 35.9 minutes to complete.

4.4.2 Surveys and Roles: Are models trully consistent across roles?

In this permutation test, we explore the likelihood that the consistency, measured by DRI, is due to chance across surveys and roles.

Survey/Role Permutation Test

Figure 4.5: Survey/Role Permutation Test

Table 4.5: Survey/Role Permutation Summary
obs_dri p n min max median iqr mean sd se ci survey role
0.045 0.000 10000 0.004 0.040 0.023 0.010 0.023 0.006 0.000 0.000 ccps eco
0.231 0.000 10000 0.183 0.229 0.205 0.010 0.206 0.007 0.000 0.000 ccps lan
-0.041 0.000 10000 -0.146 -0.058 -0.106 0.017 -0.105 0.013 0.000 0.000 energy_futures eco
-0.006 0.000 10000 -0.105 -0.021 -0.059 0.018 -0.059 0.013 0.000 0.000 energy_futures coa
-0.031 0.000 10000 -0.098 -0.040 -0.069 0.012 -0.069 0.008 0.000 0.000 energy_futures dis
0.000 0.000 10000 -0.112 -0.011 -0.059 0.019 -0.059 0.014 0.000 0.000 energy_futures far
0.085 0.000 10000 0.008 0.071 0.038 0.013 0.038 0.009 0.000 0.000 energy_futures fis
0.160 0.000 10000 0.061 0.152 0.105 0.019 0.105 0.014 0.000 0.000 energy_futures par
0.215 0.000 10000 0.056 0.209 0.129 0.032 0.130 0.023 0.000 0.000 energy_futures sub
0.029 0.000 10000 -0.078 0.019 -0.027 0.019 -0.028 0.014 0.000 0.000 energy_futures vil
0.287 0.000 10000 -0.115 0.245 0.021 0.066 0.023 0.048 0.000 0.001 energy_futures csk
-0.047 0.000 10000 -0.066 -0.049 -0.058 0.004 -0.058 0.003 0.000 0.000 zh_winterthur eco
-0.009 0.000 10000 -0.113 -0.010 -0.067 0.023 -0.066 0.016 0.000 0.000 zh_winterthur ctr
-0.221 0.000 10000 -0.253 -0.224 -0.238 0.007 -0.238 0.005 0.000 0.000 zh_winterthur par
0.158 0.000 10000 0.107 0.158 0.133 0.012 0.133 0.008 0.000 0.000 ccps coa
0.293 0.000 10000 0.170 0.295 0.222 0.029 0.223 0.020 0.000 0.000 ccps ctr
-0.131 0.000 10000 -0.186 -0.130 -0.156 0.013 -0.156 0.009 0.000 0.000 zh_winterthur far
-0.093 0.001 10000 -0.138 -0.090 -0.115 0.012 -0.115 0.008 0.000 0.000 zh_winterthur coa
-0.128 0.001 10000 -0.209 -0.123 -0.166 0.020 -0.166 0.014 0.000 0.000 zh_winterthur sub
-0.045 0.001 10000 -0.067 -0.042 -0.055 0.005 -0.055 0.004 0.000 0.000 ccps fis
0.057 0.002 10000 -0.095 0.086 -0.030 0.035 -0.028 0.026 0.000 0.001 energy_futures ctr
-0.278 0.004 10000 -0.313 -0.273 -0.292 0.008 -0.292 0.006 0.000 0.000 zh_winterthur vil
0.121 0.005 10000 0.070 0.127 0.099 0.012 0.099 0.009 0.000 0.000 ccps far
0.092 0.006 10000 -0.020 0.115 0.041 0.027 0.042 0.019 0.000 0.000 energy_futures eld
0.129 0.006 10000 0.105 0.132 0.119 0.005 0.119 0.004 0.000 0.000 ccps dis
0.035 0.007 10000 -0.034 0.045 0.008 0.017 0.008 0.012 0.000 0.000 ccps sub
0.144 0.007 10000 -0.142 0.223 -0.005 0.074 -0.001 0.054 0.001 0.001 ccps csk
-0.185 0.010 10000 -0.210 -0.182 -0.195 0.006 -0.195 0.004 0.000 0.000 zh_winterthur fis
0.114 0.010 10000 0.088 0.119 0.104 0.006 0.104 0.004 0.000 0.000 ccps par
0.194 0.011 10000 -0.081 0.269 0.036 0.079 0.042 0.057 0.001 0.001 zh_winterthur csk
0.133 0.023 10000 0.090 0.144 0.117 0.011 0.117 0.008 0.000 0.000 zh_winterthur lan
0.089 0.024 10000 0.021 0.105 0.062 0.019 0.062 0.013 0.000 0.000 zh_winterthur eld
0.048 0.054 10000 0.018 0.057 0.039 0.008 0.039 0.006 0.000 0.000 ccps vil
0.118 0.063 10000 0.067 0.136 0.101 0.015 0.101 0.011 0.000 0.000 ccps eld
-0.231 0.095 10000 -0.248 -0.225 -0.236 0.004 -0.236 0.003 0.000 0.000 zh_winterthur dis
-0.003 0.191 10000 -0.095 0.090 -0.035 0.040 -0.029 0.030 0.000 0.001 energy_futures lan

Note that this permutation test took 51.4 minutes to complete.

5 References


  1. Note that gemini-2.5-pro-preview-03-25 was replaced by gemini-2.5-pro, however, this version of the model became significantly slower and more expensive, since it has “thinking” enabled by default and cannot be toggled. As a result, we decided to use the flash version (gemini-2.5-flash), a lighter and cheaper alternative.↩︎